-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ORC and JSON tests failures for pandas 2.2 #15062
Conversation
python/cudf/cudf/tests/test_json.py
Outdated
tag == "dtype_mismatch", reason="int vs float mismatch" | ||
) | ||
) | ||
assert_eq(expected, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can pandas now handle comparing the equality of nested types/values? I remember the last time I checked pandas wasn't able to that properly and thus we resorted to use pyarrow for comparisons in case of nested types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just was able to triage the issue here, looks like pd.read_json
has a regression where the dataframe is expected to have a RangeIndex
but a materialized Index
of int64
dtype is being returned in pandas-2.2
. We should just change this test to the following:
diff --git a/python/cudf/cudf/tests/test_json.py b/python/cudf/cudf/tests/test_json.py
index ec980adc33..5a459e98d1 100644
--- a/python/cudf/cudf/tests/test_json.py
+++ b/python/cudf/cudf/tests/test_json.py
@@ -1179,6 +1179,9 @@ class TestNestedJsonReaderCommon:
def test_order_nested_json_reader(self, tag, data):
expected = pd.read_json(StringIO(data), lines=True)
+ if PANDAS_GE_200:
+ # TODO: Remove after bug fix: <Pandas-bug-URL>
+ expected = expected.reset_index(drop=True)
target = cudf.read_json(StringIO(data), lines=True)
if tag == "dtype_mismatch":
with pytest.raises(AssertionError):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Yeah testing of nested data isn't well tested in pandas so probably better to use pyarrow to compare here. Will incorporate your change
python/cudf/cudf/tests/test_json.py
Outdated
tag == "dtype_mismatch", reason="int vs float mismatch" | ||
) | ||
) | ||
assert_eq(expected, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just was able to triage the issue here, looks like pd.read_json
has a regression where the dataframe is expected to have a RangeIndex
but a materialized Index
of int64
dtype is being returned in pandas-2.2
. We should just change this test to the following:
diff --git a/python/cudf/cudf/tests/test_json.py b/python/cudf/cudf/tests/test_json.py
index ec980adc33..5a459e98d1 100644
--- a/python/cudf/cudf/tests/test_json.py
+++ b/python/cudf/cudf/tests/test_json.py
@@ -1179,6 +1179,9 @@ class TestNestedJsonReaderCommon:
def test_order_nested_json_reader(self, tag, data):
expected = pd.read_json(StringIO(data), lines=True)
+ if PANDAS_GE_200:
+ # TODO: Remove after bug fix: <Pandas-bug-URL>
+ expected = expected.reset_index(drop=True)
target = cudf.read_json(StringIO(data), lines=True)
if tag == "dtype_mismatch":
with pytest.raises(AssertionError):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mroeschke !
/merge |
Description
test_order_nested_json_reader
was refactored to useassert_eq
instead of comparing via pyarrow. This was failing in pandas 2.2 due to pandas-dev/pandas#57429test_orc_reader_trailing_nulls
I believe was failing due to a change in how integers are compared withassert_series_equal
: pandas-dev/pandas#55882. The "casting workaround" doesn't seem necessary in pandas 2.2 so just avoiding it all togetherChecklist